Document creation/reading method document creation/reading device document creation/reading robot and document creation/reading program

ABSTRACT

To enable a person to effectively create a document based on image data or audio data of a recorded meeting or a recorded lecture and also a person who creates the minutes of a meeting or a participant to browse a summarized document with images or voices so that a plurality of persons can effectively perform documentation.  
     The audio/image inputting means  10  generates image data by recording a meeting and audio data by recording the contents of the meeting. The document inputting means  20  generates document data including a drafted minutes of a meeting or the like inputted by a person who creates the minutes of a meeting. The relationship deriving means  50  generates correspondence table data by deriving relationship between voices or images and a document based on audio data or image data and document data. The relationship presenting means  60  displays voices or images and a document in association with each other based on the correspondence table data

TECHNICAL FIELD

The present invention generally relates to a technique of usingmulti-media, and more specifically to a documentation browsing method,documentation browsing apparatus, a documentation browsing robot and adocumentation browsing program, which enable a user to create and browsea document by associating audio data or image data of a recorded meetingor a lecture with a created document in order to create a document onthe meeting or the lecture.

BACKGROUND ART

Scenes or audio of a meeting or a lecture may be recorded with apparatussuch as a video recorder or a tape recorder so that documentation ofminutes of the meeting or the lecture may be supported. In such a case,documentation is done based on the recorded image or audio. A person whocreates the minutes of a meeting/a lecture needs to create the minutesby playing and watching the image or listening to the audio after themeeting or the lecture. That requires the person who creates the minutesof a meeting/a lecture to create the minutes of a meeting/a lecture byrepeatedly referring to the image or the audio spending time as much asor longer than the time of the meeting or the lecture. Accordingly, thatrequires much workload of the person.

As a technique for creating or supporting to create the minutes of ameeting or the like, the Patent Document 1 describes an apparatus forsupporting to create the minutes of a meeting which alleviates a burdenof a person who creates the minutes of a meeting by specifying a rangeof audio to be played back for enabling playback when the person makestext memorandum from audio memorandum recording remarks of those at inthe meeting. The apparatus for supporting to create the minutes of ameeting described in the Patent Document 1 includes playback means forplaying back sounds or voices that participants in a predetermined rangecan hear. Documentation means of the apparatus for helping create theminutes of a meeting creates text memorandum based on the played backaudio memorandum and structures the text memorandum according tospecification. Displaying means of the apparatus for helping create theminutes of a meeting displays the structured text memorandum.

The Patent Document 2 describes an intelligent meeting support systemfor enabling automatic creation of the minutes of a meeting withoutrequiring any burden to participants. In the intelligent meeting supportsystem described in the Patent Document 2, an apparatus for creating theminutes of a meeting converts audio text data into sentences includingChinese characters by performing semantic analysis based on the basicaudio data outputted from an audio collecting device. Then, theapparatus for creating the minutes of a meeting creates the minutes of ameeting based on the converted sentences.

The Patent Document 3 describes a system for automatically creating theminutes of a meeting which can automatically creates the minutes of themeeting in real time and automatically send the minutes of the meetingjust after the meeting. In the system for automatically creating theminutes of a meeting described in the Patent Document 3, the apparatusfor creating the minutes of a meeting automatically creates a documentbased on image information or audio information by using speechrecognition technique. In order to create the minutes of a meeting, anoperator such as an MC divides audio meeting information for appropriatesections and inputs a keyword for each section at the same time so thataccuracy of speech recognition is improved and audio data for eachsection defined by the operator is converted into text data and theminute of the meeting is created.

Patent Document 1: Japanese Patent Laid-Open Publication No. 8-194492(pp. 4-6, Figures 1-3)

Patent Document 2: Japanese Patent Laid-Open Publication No. 2000-112931(pp. 3-9, Figures 1-3)

Patent Document 3: Japanese Patent Laid-Open Publication No. 2002-344636(pp. 3-5, Figure 1)

DISCLOSURE OF THE INVENTION

The apparatus for supporting to create the minutes of a meetingdescribed in the Patent Document 1 avoids blocking the meeting as a playback range is specified and audio memorandum is played back large enoughto reach predetermined participants. If the content of the audiomemorandum is unclear during preparation of the minutes of a meeting,the participants can review the content and correctly create sentencesfrom the remarks so that they can quickly create high quality minutes ofthe meeting with clear relationship between remarks. However theapparatus still requires the person who creates the minutes of a meetingto create the minutes of the meeting by repeatedly referring to therecorded audio of the meeting. That is still a heavy burden to theperson who creates the minutes of a meeting.

The intelligent meeting support system described in the Patent Document2 or the system for automatically creating the minutes of a meetingdescribed in the Patent Document 3 can reduce the burden to the personwho creates the minutes of the meeting by automatically creating theminutes of the meeting. However, since the systems depend on speechrecognition for creating a text, if the speech recognition is notaccurate enough, the created text may be erroneous. Then users must workout the errors into understandable expressions, resulting in higher costfor creating a document. Particularly, as recording conditions are poorin a big conference hall or a big lecture hall, the systems have aproblem in accuracy of recognizing sounds or voices.

For a debate or the like in the Diet or a city assembly, the minutes ofa meeting are usually created closely to the remarks. In contrast, for ameeting held in a private company, the minutes of a meeting are usuallycreated with only the remarks which are meaningful to participants andsummarized instead of recording the proceeding or all the remarks of theparticipants. In such a case, it is not enough to create the minutes ofa meeting just by converting audio data into text data with speechrecognition. That is to say, it is difficult to automatize generation ofcompact minutes of a meeting/lecture including summary as the systemscannot choose meaningful parts only by using the automatic recognizingtechniques.

Operation to summarize only meaningful parts for the participants mayrequire all the participants to review the summarized contents. In sucha case, it is difficult for each participant to determine which items ofa proceeding or remarks should be chosen just by reading the generateddocument by speech recognition. Therefore, the document may be notcorrected.

In order to improve accuracy of the speech recognition, a sophisticatedmicrophone may be used to collect voices. However that requires a userto set an input device/analysis device independently for each speaker,which increases the size of the entire system, and results in costincrease.

Therefore, the present invention aims to solve the above mentionedproblems, and intends to provide a documentation browsing method, adocumentation browsing apparatus, a documentation browsing robot and adocumentation browsing program, which can effectively create a documentbased on the recorded image data or audio data of a meeting or alecture, and enable a person who creates a document or participants toread the summarized document with images or voices so that a pluralityof persons can effectively create a document.

The present invention intends to provide a documentation browsingmethod, a documentation browsing apparatus, a documentation browsingrobot and a documentation browsing program, which can create a documentat a low price by enabling a person to edit or add a document which wascreated based on the memorandum of a participant during the meeting orthe lecture, without requiring a skill of a stenographer and superfluouscosts at the Diet or a lecture of a celebrity.

The present invention intends to provide a documentation browsingmethod, a documentation browsing apparatus, a documentation browsingrobot and a documentation browsing program, which allow a user not onlyto create and browse the minute of a meeting which is a faithfulproduction of the remarks in a meeting or a lecture but also to createand browse a document which summarizes the content of the meeting or thelecture for user's confirmation.

Further, the present invention intends to provide a documentationbrowsing method, a documentation browsing apparatus, a documentationbrowsing robot and a documentation browsing program, which allow a userto create and browse a document by using speech recognition even if thespeech recognition is not accurate enough.

A documentation browsing method according to the present inventionincludes the steps of: generating correspondence between voices orimages included in audio data or image data and documents included indocument data; displaying the voices or the images included in audiodata or image data and the document included in document data associatedwith each other based on the correspondence; and updating document dataassociated with the displayed document.

The documentation browsing method may also include the steps of:dividing voices or images included in audio data or image data and adocument included in document data into predetermined sections andgenerating correspondence between voices or images and a document foreach section; and generating, adding, correcting or deleting a documentfor each section. Such a configuration enables a user to select and editonly a section which the user wants to correct in the entire document.Accordingly, the documentation browsing method can streamline user'swork to create a document.

The documentation browsing method may also include the steps of:establishing a matching between a document and voices or images bymatching text generated by speech recognition or image recognition and adocument included in document data; and generating correspondence basedon the established matching. Such a configuration enables words orphonemes to be extracted and compared and relationship between adocument and voices or images to be extracted.

The documentation browsing method may also include the steps of:extracting a basic word to be a basic unit for matching text and adocument from words included in each section of text and a document;calculating and comparing similarities between groups of the basicwords, which are sets of the extracted basic words, and establishing amatching between a document and voices or images; and generatingcorrespondence based on the established matching. Such a configurationcan complement the accuracy of speech recognition even when a lecturehall has poor conditions for collecting sound, and can reduce the sizeof the entire system or cost of equipment.

Documentation browsing apparatus according to the present invention ischaracterized in that it includes: correspondence generating means forgenerating correspondence between voices or images included in audiodata or image data and a document included in document data; associationdisplaying means for displaying the voices or the images included inaudio data or image data and the document included in document dataassociated with each other based on the correspondence; and documentupdating means for updating document data associated with a documentdisplayed on the displaying means.

The correspondence generating means may divide voices or images includedin audio data or image data and a document included in document datainto predetermined sections and generate correspondence between voicesor images and a document for each section; and document updating meansmay generate, add, correct or delete a document for each section. Such aconfiguration enables a user to select and edit only a section which theuser wants to correct in the entire document. Accordingly, the means canstreamline user's work to create a document.

The documentation browsing apparatus includes playback means for playingback voices or images, wherein the correspondence generating means maydivide voices or images included in audio data or image data and adocument included in document data into predetermined sections andgenerate correspondence between voices or images with a document foreach section; and document updating means may generate, add, correct ordelete a document for each section; and playback means may playbackvoices or images of a part associated with at least one section includedin document data among sounds or images included in audio data or imagedata. Such a configuration enables a user to select only sounds orimages of a section which the user wants to check in the contents of thedocument and playback the audio or the image to check the contents.Accordingly, the document browsing apparatus can streamline user's workto create a document.

In the document browsing apparatus, the document data may include adocument of the minutes of a meeting or a lecture or a lesson, and theaudio data or the image data may include voices or images of recordedcontents of the meeting, the lecture or the lesson. Such a configurationcan streamline documentation of the minutes of a meeting, a lecture or alesson.

The correspondence generating means may extract a starting time and anending time of a section of voices or images and generate associationinformation which associates the starting time with the ending time anddocument data. Such a configuration can display and present the startingtime and the ending time of the voices or the images and the documentassociated with each other to the user based on the associationinformation.

The relationship displaying means may display time information whichindicates the elapsed time of voices or images associating to eachsection of document data. Such a configuration can visually present theelapsed time of the voices or the images to the user.

The relationship displaying means may also display a displaying locationof a document included in document data on a display screen inassociation with time information which indicates an elapsed time ofvoices or images. Such a configuration can visually present the elapsedtime of voices or images in association with an associated document tothe user.

The association displaying means may display the length of displayingeach section of document data on a display screen by a length inproportion to the playback time of the voices or the images associatedwith each section. Such a configuration enables a user to intuitivelyrecognize and understand the time spent for each item in a meeting orthe like.

The association displaying means may display the length of each sectionof document data on a display screen by a predetermined length. Such aconfiguration enables a user to intuitively understand the number ofsections of the voices, the number of sections of the image or thedocument, or the number of all sections.

The association displaying means may display the length of each sectionof document data on a display screen by a length in proportion to theamount of documents in documents associated with each section. Such aconfiguration can increase display density for displaying a document.Such a configuration also enables a user to browse or create a documenttotally compact.

The documentation browsing apparatus may include display type selectionmeans for selecting a type of displaying the length of each section ofdocument data on a display screen, wherein the display type selectionmeans may select a display type of a length in proportion to theplayback time of the audio or the image associated with each section, ora predetermined length, or a length in proportion to the amount ofdocuments in documents associated with each section according to auser's selecting instruction; and the association displaying means maydisplay each section according to the display type selected by thedisplay type selection means. Such a configuration enables a user tofreely select the display type of a document according to the usage ofthe document and effectively perform the documentation.

The association displaying means may display a time length of voices orimages by a length of a display bar indicating a time axis on a displayscreen, and display the time length of voices or images and charactersin a document of a section associated with the voices or the images inthe same color. Such a configuration facilitates a user to visuallyrecognize relationship between voices or images and a document for eachsection.

The documentation browsing apparatus may include mismatch detectingmeans for detecting the case where voices or images and a document isnot associated with each other as mismatch state of the audio or theimage and the document, and mismatch displaying means for displayingthat the audio or the image and the document mismatch when the mismatchis detected, wherein the mismatch displaying means may display that nosection of documents associated with a section of voices or imagesexists or that no section of voices or images associated with a sectionof the document exists as a mismatch state. Such a configuration enablesa user to correctly recognize relationship between voices or images anda document even when a part of voices or images with no associateddocument is included or when a part of a document with no associateddocument is included.

The documentation browsing apparatus may include matching means forestablishing a matching between a document and voices or images, whereinthe correspondence generating means may generate correspondence based onthe extracted established matching.

The documentation browsing apparatus may include document storage meansfor storing document data and audio and image storage means for storingaudio data or image data. Such a configuration can temporarilyaccumulate storage data of a meeting or a lecture and create a documentof the minutes of a meeting or a lecture afterward.

The matching means may establish a matching between a document andvoices or images by matching text generated by speech recognition orimage recognition and a document included in document data. Such aconfiguration can extract relationship between a document and voices orimages by extracting and comparing a word or a phoneme.

The matching means may match text and a document by using a dynamicprogram matching system with a word appearing in text or a document.Such a configuration can associate voices or images with a document byperforming effective calculation reusing the calculated result ofrelationship.

The association displaying means may display text associated with adocument among texts generated by the matching means in addition to adocument included in the document data. Such a configuration facilitatesa user to easily recognize the contents of a meeting which is notdescribed even when a part with no description of a document associatedwith voices or images is included.

The documentation browsing apparatus may include basic word extractingmeans for extracting a basic word to be a basic unit for matching textand a document from words included in each section of text and adocument, wherein the matching means may establish a matching between adocument and voices or images by calculating and comparing similaritiesbetween groups of basic words, which are sets of the basic wordsextracted by the basic word extracting means. Such a configuration cancomplement accuracy of speech recognition and reduce the size of theentire system and cost of equipment even when a lecture hall has poorconditions for collecting sound.

The matching means may match text and a document by using a dynamicprogram matching system based on a group of basic words including basicwords extracted by the basic word extracting means. Such a configurationcan associate voices or images with a document by performing effectivecalculation reusing the calculated result of relationship based on thebasic words.

The association displaying means may display a basic word associatedwith a document among basic words extracted by the basic word extractingmeans in addition to a document included in the document data. Such aconfiguration enables a user to immediately recognize the contentsdescribed in each section.

The matching means may include relationship correction means forcorrecting the extracted relationship according to user's operation.Such a configuration enables a user to perform effective documentationby correcting relationship even when erroneous relationship betweenvoices or images and a document is presented.

The documentation browsing apparatus may include relationshiprecalculation instruction means for outputting recalculation instructioninformation for instructing recalculation of relationship between adocument and voices or images and cause the matching means torecalculate relationship. Such a configuration can recalculaterelationship between voices or images and a document and present theresult to a user when the document is edited and updated.

The documentation browsing apparatus may include document updatedetermination means for determining whether the document updating meansupdates document data or not; wherein the relationship recalculationinstruction means may output recalculation instruction information andcause the relationship extraction means to recalculate relationship whenthe document data is determined to be updated. Such a configuration candetect that a document is updated, calculate relationship between voicesor images and the document and present the result to a user.

The documentation browsing apparatus may include outputting means foroutputting a document based on document data. Such a configuration canprint and output a document such as the completed minutes of a meetingor a lecture.

A documentation browsing robot according to the present invention ischaracterized by having documentation browsing apparatus.

A documentation browsing program according to the present invention ischaracterized by causing a computer to execute the process of:generating correspondence between voices or images included in audiodata or image data and a document included in document data; displayingthe audio or the image included in audio data or image data and thedocument included in document data associated with each other based onthe correspondence; and updating document data associated with thedisplayed document.

The documentation browsing program may also cause a computer to executethe process of: dividing voices or images included in audio data orimage data and a document included in document data for predeterminedsections and generating correspondence between voices or images with adocument for each section; and generating, adding, correcting ordeleting a document for each section. Such a configuration enables auser to select and edit only a section which the user wants to correctin the entire document. Accordingly, the documentation browsing programcan streamline work for the user to create a document.

The documentation browsing program may also cause a computer to executethe process of: divide voices or images included in audio data or imagedata and a document included in document data into predeterminedsections and generate correspondence between voices or images and adocument for each section; and generate, add, correct or delete adocument for each section; and playback voices or images of a partassociated with at least one section included in document data amongaudio or images included in audio data or image data. Such aconfiguration enables a user to select only voices or images of asection where the user wants to check the contents of the document andplayback the audio or the image to check the contents. Accordingly, thedocument browsing program can streamline user's work to create adocument.

The document browsing program may also cause a computer to execute theprocess based on document data including a document of the minutes of ameeting or a lecture or a lesson, and the audio data or the image dataincluding voices or images of recorded contents of the meeting, thelecture or the lesson. Such a configuration can streamline documentationof the minutes of a meeting, a lecture or a lesson.

The document browsing program may also cause a computer to execute theprocess of: extracting a starting time and an ending time of a sectionof voices or images and generating association information whichassociates the starting time and the ending time with document data.Such a configuration can display and present the starting time and theending time of the voices or the images and the starting time and theending time of the document associated with each other to the user basedon the association information.

The document browsing program may also cause a computer to execute theprocess of displaying time information which indicates elapsed time ofvoices or images associated to each section of document data. Such aconfiguration can visually present elapsed time of the voices or theimages to the user.

The document browsing program may also cause a computer to execute theprocess of displaying a displaying location of a document included indocument data on a display screen in association with time informationwhich indicates an elapsed time of voices or images. Such aconfiguration can visually present the elapsed time of voices or imagesin association with an associated document to the user.

The document browsing program may also cause a computer to execute theprocess of displaying the length of each section of a document data on adisplay screen by the length in proportion to the playback time of thevoices or the images associated with each section. Such a configurationenables a user to intuitively recognize and understand the time spentfor each item in a meeting or the like.

The document browsing program may also cause a computer to execute theprocess of displaying the length of displaying each section of documentdata on a display screen by a predetermined length. Such a configurationenables a user to intuitively understand the number of sections of thevoices, the image or the document displayed on the screen or the numberof all sections.

The document browsing program may cause a computer to execute theprocess of displaying the length of each section of document data on adisplay screen by the length in proportion to the amount of documents indocuments associated with each section. Such a configuration can improvedisplay density for displaying a document. The configuration alsoenables a user to browse or create a document totally compact.

The document browsing program may also cause a computer to execute theprocess of: selecting a display type as display type of a length ofdocument data on a display screen of each section from displaying alength in proportion to the playback time of the audio or the imageassociated with each section, or a predetermined length, or a length inproportion to the amount of documents associated with each sectionaccording to a user's indication; and displaying each section accordingto the selected display type. Such a configuration enables a user tofreely select the display type of a document according to the usage ofthe document and effectively perform the documentation.

The document browsing program may cause a computer to execute theprocess of displaying a time length of voices or images by a length of adisplay bar indicating a time axis on a display screen, and displayingthe time length of voices or images and characters in a document of asection associating the voices or the images in the same color. Such aconfiguration facilitates a user to visually recognize relationshipbetween voices or images and a document for each section.

The document browsing program may cause a computer to execute theprocess of detecting the case where voices or images and a document isnot associated with each other as a mismatch state of the audio or theimage and the document, and displaying that no section of documentsassociated with a section of voices or images exists or that no sectionof voices or images associated with a section of the document exists asa mismatch state when the mismatch is detected. Such a configurationenables a user to correctly recognize relationship between voices orimages and a document even when a part of voices or images with noassociated document is included or when a part of a document with noassociated voices or images is included.

The document browsing program may cause a computer to execute theprocess of establishing a matching between a document and voices orimages, and generating correspondence based on the extracted matching.

The document browsing program may cause a computer to execute theprocess of causing document storage means for storing document data tostore document data generated based on a document inputted by a user;and causing audio and image storage means for storing audio data orimage data to store audio data or image data generated with recordeddialogs. Such a configuration can temporarily accumulate storage data ofa meeting or a lecture and create a document of the minutes of a meetingor a lecture after the meeting or the lecture.

The document browsing program may cause a computer to execute theprocess of extracting association of a document and voices or images bymatching text generated by speech recognition or image recognition and adocument included in document data. Such a configuration can extractrelationship between a document and voices or images by extracting andcomparing words or phonemes.

The document browsing program may cause a computer to execute theprocess of matching text and a document by using a dynamic programmatching system with words appearing in text or a document. Such aconfiguration can associate voices or images with a document byperforming effective calculation reusing the calculated result ofassociation.

The document browsing program may cause a computer to execute theprocess of displaying text associated with a document in generated textin addition to the document included in the document data. Such aconfiguration facilitates a user to easily recognize the contents of themeeting which is not described even when a part with no description of adocument associated with voices or images is included.

The document browsing program may cause a computer to execute theprocess of extracting a basic word to be a basic unit for matching textand a document from words included in each section of the text and thedocument, and establishing a matching between a document and voices orimages by calculating and comparing similarities between groups of thebasic words, which are sets of the extracted basic words. Such aconfiguration can complement accuracy of speech recognition and reducethe size of the entire system or cost of equipment even when a lecturehall has poor conditions for collecting sound.

The document browsing program may cause a computer to execute theprocess of matching text and a document by using a dynamic programmatching system based on a group of the basic words including extractedbasic words. Such a configuration can associate voices or images with adocument by performing effective calculation reusing the calculatedresult of relationship based on the basic words.

The document browsing program may cause a computer to execute theprocess of displaying a basic word associated with the document amongextracted basic words in addition to a document included in the documentdata. Such a configuration enables a user to immediately recognize thecontents described in each section.

The document browsing program may cause a computer to execute theprocess of correcting the extracted relationship according to user'soperation. Such a configuration enables a user to perform effectivedocumentation by correcting relationship even when erroneousrelationship between voices or images and a document is presented.

The document browsing program may cause a computer to execute theprocess of outputting recalculation instruction information forinstructing recalculation of relationship between a document and voicesor images and recalculating the relationship. Such a configuration canrecalculate relationship between voices or images and a document andpresent the result to a user when the document is edited and updated.

The document browsing program may cause a computer to execute theprocess of determining whether document data is updated or not andoutputting recalculation instruction information and recalculatingrelationship when the document data is determined to be updated. Such aconfiguration can detect that a document is updated, calculaterelationship between voices or images and the document and present theresult to a user.

The document browsing program may cause a computer to execute theprocess of outputting a document based on document data. Such aconfiguration can print and output a document such as the completedminutes of a meeting or a lecture.

According to the present invention, the documentation browsing method,the documentation browsing apparatus, the documentation browsing robotand the documentation browsing program are adapted to display voices orimages and a document associated with each other based on generatedcorrespondence and updating document data associating the displayeddocument. Therefore, the present invention can streamline user'sdocumentation as the user can edit a document by the user checking adisplay screen and playing back only voices or images of a particularplace required for documentation without playing back the entire voicesor images. If a plurality of users creates a document, each of the userscan easily playback and browse grounds for creating the document orvoices or images associated with a particular place. Accordingly, allthe users can immediately recognize the contents of the document. Thatcan streamline documentation which is performed by a plurality of users.Therefore, the present invention can effectively create a document basedon image information or audio information of a recorded meeting or arecorded lecture. With the present invention, persons who create adocument or participants can browse a summarized document with images orvoices and a plurality of persons can effectively perform documentation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing an example of an arrangement ofdocumentation browsing apparatus according to an embodiment;

FIG. 2 is a flowchart showing an example of the minutes of a meetingcreating process and the minutes of a meeting browsing process;

FIG. 3 is a block diagram showing an example of an arrangement of therelationship deriving means 50;

FIG. 4 is a block diagram showing an example of an arrangement of thespeech recognition means 22;

FIG. 5 is a block diagram showing an example of an arrangement of thecandidate text/document associating means 23;

FIG. 6 is an illustration schematically showing operation of the processof associating sections performed by the candidate text/documentassociating means 23;

FIG. 7 is an illustration showing an example where association isperformed by speech recognition match for associating an image or audioand a document;

FIG. 8 is an illustration showing an example of associating by matchinga level of word line used in a section for associating an image or audioand a document;

FIG. 9 is an illustration schematically showing an example ofassociation between an image or audio and a document and correspondencetable data;

FIG. 10 is a block diagram showing another example of an arrangement ofthe relationship deriving means 50;

FIG. 11 is an illustration showing an example of a relationship displayscreen between voices/images and a document;

FIG. 12 is an illustration showing another example of a method fordisplaying a time axis for displaying a document and an associatedimage;

FIG. 13 is a block diagram showing an example of an arrangement of therelationship presenting means 60;

FIG. 14 is a block diagram showing an example of an arrangement of therelationship presenting means 60;

FIG. 15 is an illustration showing another example of a relationshipdisplay screen between voices/images and a document;

FIG. 16 is an illustration showing yet another example of a relationshipdisplay screen between voices/images and a document;

FIG. 17 is a block diagram showing another example of an arrangement ofthe documentation browsing apparatus; and

FIG. 18 is an illustration showing an example of a documentationbrowsing robot having the documentation browsing apparatus.

DESCRIPTION OF SYMBOLS

10 audio/image inputting means

20 document inputting means

21 audio/image section extracting means

22 speech recognition means

23 candidate text/document association means

30 audio/image storage means

31 frequency analysis unit

32 phoneme recognition unit

33 word/document recognition unit

3 phoneme template dictionary

35 dictionary/language model accumulating unit

40 document storage means

41 word in candidate text section extracting means

42 document section extracting means

43 word in document section extracting means

44 candidate text section/document section associating means

45 candidate text section/document section word similarity calculatingmeans

50 relationship deriving means

51 presenting type selecting means

52 individual relationship presenting means

53 association presenting means

60 relationship presenting means

61 relationship analysis deriving means

62 relationship holding means

63 relationship correcting means

70 document editing means

80 associated audio/image playback means

81 inconformity detecting means

82 inconformity presenting means

90 outputting means

100 relationship updating instructing means

110 document editing state observing means

BEST MODE FOR CARRYING OUT THE INVENTION EMBODIMENT 1

An embodiment of the present invention will be described with referenceto drawings. FIG. 1 is a block diagram showing an example of anarrangement of a documentation browsing apparatus according to thepresent invention. As shown in FIG. 1, the documentation browsingapparatus includes audio/image inputting means 10 for inputting voicesor images, document inputting means 20 for inputting documents,audio/image storage means 30 for storing audio data or image data,document storage means 40 for storing document data, relationshipderiving means 50 for deriving relationship between voices or images anddocuments, relationship presenting means 60 for displaying andpresenting relationship between voices or images and documents,associated audio/image playback means 80 for playing back voices orimages, document editing means 70 for editing document data, outputtingmeans 90 for outputting documents, relationship updating instructingmeans 100 for instructing reevaluation of relationship, and documentediting state observing means 110 for monitoring a state of documentedition.

In this embodiment, an example where a person who participated a meetingcreates the minutes of the meeting (proceeding) as a document by usingthe documentation browsing apparatus will be described. Thedocumentation browsing apparatus cannot only be adapted to create theminutes of a meeting but also to create a document in generalcircumstances including a lecture or a lesson.

The audio/image inputting means 10 is realized by an image inputtingdevice such as a video camera, an audio inputting device such as amicrophone and an information processing unit such as a computer. Theaudio/image inputting means 10 generates image data by recording animage of an event such as a meeting, which is an object of documentation(in this example, creating proceeding), and generates audio data byrecording voices in a meeting. In creating the minutes of a meetingaccording to the embodiment, an example where one or more video camerasfor taking a picture of meeting scenes are set and an audio recorderworking in conjunction with a tape recorder or a microphone is set on adesk in a meeting room will be described.

The document inputting means 20 is realized by a text inputting deviceincluding a keyboard, a pen inputting device or a scanner, and aninformation processing unit such as a computer. The document inputtingmeans 20 accepts an input of a document which describes a target eventsuch as a meeting. For example, the document inputting means 20 acceptsan input of a first article of the minutes of a meeting (proceeding)which is created based on a memorandum made by other participants of themeeting according to an operation of the person who creates the minutesof the meeting. The document inputting means 20 generates document databased on an inputted document.

In this embodiment, a keyboard for creating the minutes of a meeting, ascanner or an OCR (Optical Character Reader) for reading and recordingthe minutes of a meeting created on a sheet of paper is set in a meetingroom as the document inputting means 20. As document inputting means 20,an inputting device for reading a document created by another wordprocessor may also be set.

The audio/image storage means 30 is a recording medium such as RAM(Random Access Memory), flash memory or a hard disk. The audio/imagestorage means 30 stores audio data or image data generated by theaudio/image inputting means 10. The document storage means 40 is arecording medium such as RAM, flash memory or a hard disk. The documentstorage means 40 stores document data generated by the documentinputting means 20. The document storage means 40 also stores documentdata edited and updated by the document editing means 70.

The relationship deriving means 50 compares audio data or image datastored by the audio/image storage means 30 and document data stored bythe document storage means 40, and derives relationship between asegment from voices or images and a segment from documents. For example,the relationship deriving means 50 derives relationship indicating thesegment included in documents which is associated with a segment invoices or images.

The relationship presenting means 60 displays the relationship betweenvoices/images and a document derived by the relationship deriving means50 on a display unit (not shown) and presents it to a person who createsa document or a participant of a meeting. The associated audio/imageplayback means 80 plays back voices or images associated with aparticular place of a document according to a playback instruction froma participant of a meeting who checked relationship presented by therelationship presenting means 60. The relationship deriving means 50,the relationship presenting means 60 and the associated audio/imageplayback means 80 are realized, for example, by an informationprocessing unit such as a computer.

The document editing means 70 updates document data stored by thedocument storage means 40 by creating, deleting or correcting thedocument. For example, a person who drafts the minutes of a meetingcompletes the minutes by checking relationship presented by therelationship presenting means 60 and updating the minutes as operatingthe document editing means 70 to enlarge/correct the self made minutes,or by collecting opinions from respective participants who checked therelationship and operating the document editing means 70 and updatingdocument data stored by the document storage means 40 based on thecollected opinion.

The document editing means 70 is an information processing terminal suchas a personal computer. The document editing means 70 may be present foreach of the participants of a meeting. In such a case, each participantmay correct the minutes of a meeting by checking relationship presentedby the relationship presenting means 60 and operating a terminal byhimself/herself and updating document data stored by the documentstorage means 40.

The outputting means 90 outputs a document based on document data storedin the document storage means 40. For example, the outputting means 90is realized by an outputting device such as a printer and an informationprocessing unit such as a computer, and outputs the minutes of a meetingbased on document data stored by the document storage means 40.

The relationship updating instructing means 100 reevaluates relationshipbetween voices or images and documents when document data is updated. Auser may directly give an instruction to the relationship updatingindicating means 100. The document editing state observing means 110monitors a document editing state or the like and detects whetherdocument data is updated or not. The relationship updating instructingmeans 100 and the document editing state observing means 110 arerealized by an information processing unit such as a computer.

Audio and image generating means is realized by the audio/imageinputting means 10. Document generating means is realized by thedocument inputting means 20. Relationship information generating means60 and matching means are realized by the relationship deriving means50. Association displaying means is realized by the relationshippresenting means 60. Document updating means is realized by the documentediting means 70. Playback means is realized by the associatedaudio/image playback means 80. Relationship recalculation instructingmeans is realized by the relationship updating instructing means 100.Document update determining means is realized by the document editingstate observing means 110.

An arrangement of the embodiment can be realized by software. Forexample, a storage device (not shown) of an information processing unitwhich realizes a documentation browsing apparatus stores a documentationbrowsing program for causing a computer to execute the process of:generating audio data or image data by recording the contents ofdialogue, generating document data based on a document inputted by auser, generating correspondence between voices or images included inaudio data or image data with documents included in document data,displaying voices or images included in audio data or image data anddocuments included in document data associated with each other based oncorrespondence, and updating document data associated with the displayeddocument.

Now, operation will be described. FIG. 2 is a flowchart showing anexample of the minutes creating process for creating the minutes of ameeting according to operation by a person who creates the minutes of ameeting and the minutes browsing process for a participant of a meetingor the like to browse the created minutes of a meeting. In theembodiment, in advance one of the participants has taken on a role of aperson who creates the minutes of a meeting. Alternatively, nay one ofthe participants may create the minutes.

The audio/image inputting means 10 shoots scenes of a meeting andrecords sounds or voices and generates image data and audio data (stepS101). The audio/image inputting means 10 causes the audio/image storagemeans 30 to store the generated image data and audio data.

A person creates the minutes of a meeting based on a memorandum which iscreated by himself/herself during the meeting or by another participant.Then, the person who creates the minutes of the meeting operates thedocument inputting means 20 and inputs the created minutes of themeeting. The document inputting means 20 accepts the input of theminutes of a meeting and generates document data according to theoperation by the person (step S102). The document inputting means 20causes the document storage means 40 to store the generated documentdata.

The relationship deriving means 50 extracts audio data or image datafrom the audio/image storage means 30 and extracts document data fromthe document storage means 40. The relationship deriving means 50derives relationship between a section in voices or images and a sectionin documents based on the extracted audio data or the extracted imagedata and the extracted document data.

FIG. 3 is a block diagram showing an example of an arrangement of therelationship deriving means 50. As shown in FIG. 3, the relationshipderiving means 50 includes audio/image section extracting means 21 fordividing voices or images into sections, speech recognition means 22 forgenerating a text line by performing speech recognition, and candidatetext/document associating means 23 for associating the text linegenerated by the speech recognition means 22 and document data. In theembodiment, a case of dividing audio into sections based on audio dataand deriving its relationship with the document will be described.

The relationship deriving means 50 may divide voices or voices intosections based on audio data included in an audio track in image data,instead of dividing sounds or voices into sections based on audio dataof the direct record of a meeting, to establish an association byperforming speech recognition. Hereinafter, the audio data of the directrecord of a meeting will be merely referred to as input audio data.Audio data included in an audio track in image data will be referred toas track audio data.

The audio/image section extracting means 21 divides voices or imagesinto sections based on audio data or image data stored by theaudio/image storage means 30 (step S103). In the embodiment, each of thedivided sections is called audio section or image section. In thisexample, the audio/image section extracting means 21 extracts an audiosection based on input audio data or track audio data.

In order to derive relationship based on audio data, the audio/imagesection extracting means 21 extracts an audio section based on audiodata stored by the audio/image storage means 30. For example, theaudio/image section extracting means 21 extracts an audio section bydividing voices included in audio data at each silent section.

In order to derive relationship based on image data, the audio/imagesection extracting means 21 extracts an image section based on imagedata stored by the audio/image storage means 30. For example, theaudio/image section extracting means 21 extracts an image section byusing switching point information in scenes.

The audio/image section extracting means 21 may extract a sectionwithout regard for the contents of voices or images, merely extracting asection for each five minutes or ten minutes.

The speech recognition means 22 performs speech recognition based onsection information of voices or images from the audio/image sectionextracting means 21 (in this example, audio section information) andaudio data stored by the audio/image storage means 30 (inputted audiodata) or track audio data included in image data, converts voices of ameeting into text and generates a text line. Hereinafter, a text linegenerated by the speech recognition means 22 will be referred to ascandidate text.

As an speech recognition system, generally many approaches have beenproposed. FIG. 4 is a block diagram showing an example of an arrangementof the speech recognition means 22. As shown in FIG. 4, the speechrecognition means 22 includes frequency analysis unit 31 for analyzingfrequencies of audio data, a phoneme recognition unit 32 for convertingaudio data into phoneme data, a word/document recognition unit 33 forconverting phoneme data into word data or document data, a phonemetemplate 34 for accumulating representative data of a phoneme, and adictionary/language model accumulating unit 35 for accumulatinginformation on words or language models.

The frequency analysis unit 31 generates acoustic characteristicquantity which is quantitative data of a multidimensional vectorconverted from frequency-analyzed audio data. The frequency analysisunit 31 generates as acoustic characteristic quantity, for example,quantitative data of cepstrum or normalized logarithm power convertedfrom audio data. The frequency analysis unit 31 generates as acousticcharacteristic quantity, for example, quantitative data of the first orthe second derivative of cepstrum or normalized logarithm power based onaudio data.

The phoneme template dictionary 34 previously stores representative datasuch as cepstrum of each phoneme as a template. Hereinafter, data storedby the phoneme template dictionary 34 as a template will be referred toas a phoneme template. The phoneme recognition unit 32 compares theacoustic characteristic quantity generated by the frequency analysisunit 31 and each phoneme template stored by the phoneme templatedictionary 34 and converts inputted voices into phoneme data. In such acase, the phoneme recognition unit 32 converts inputted voices intophoneme data in the inputted order according to a time line and outputsthe phoneme data.

The dictionary/language model accumulating unit 35 previously storesbeforehand information on relationship between a word and a phoneme ineach language and information on appearance probability in each state ina document. The “appearance probability in each state” means anexpectancy indicating a degree of probability of a word appearing in adialog in a meeting for a certain agenda or an expectancy of a phonemeexpected to appear behind a specific word from a grammatical point ofview. For example, if probability that a post positional particle or anauxiliary verb is grammatically expected to appear behind a certain wordis high, the dictionary/language model accumulating unit 35 storesinformation on high appearance probability of a phoneme associated withthe post positional particle or the auxiliary verb.

The word/document recognition unit 33 determines whether word datastored by the dictionary/language model accumulating unit 35 includesthe data that matches a phoneme line included in phoneme data generatedby the phoneme recognition unit 32. The word/document recognition unit33 converts a phoneme line included in phoneme data generated by thephoneme recognition unit 32 into a word or a document based on a degreeof matching with the word data stored in the dictionary/language modelaccumulating unit 35, appearance probabilities in the current state, orthe like. For example, the word/document recognition unit 33 convertsphoneme sequences into a word such as post positional particles orauxiliary verbs in preference to other words when it is determined thatprobability of using a post positional particle or an auxiliary verb ishigh based on appearance probabilities even if the voices are unclear.For example, if the word/document recognition unit 33 determines basedon an appearance probability or the like that participant's name waslikely indicated, it converts phoneme sequences into a character stringof the participant's name in preference to other words.

The word/document recognition unit 33 generates text data including theconverted words or a converted document as candidate texts (step S104).

Although an example of performing speech recognition by phoneme, speechrecognition may be performed with a word-based template instead ofphoneme data. A template may be converted according to a speaker. Anytype of speech recognition may be used. For example, the phonemetemplate dictionary 34 may previously store a phoneme template of malevoice and a phoneme template of female voice. In such a case, the speechrecognition means 22 may identify whether it is a male voice or a femalevoice to perform speech recognition by using either a phoneme templateof male voice or a phoneme template of female voice. It may also use aphoneme template of child voice or a phoneme template of adult voice. Itmay also prepare a phoneme template for a particular person.

The candidate text/document associating means 23 associates a candidatetext for each audio/image section generated by the audio/image sectionextracting means 21 with document data extracted from the documentstorage means 40, and derives the relationship.

FIG. 5 is a block diagram showing an example of an arrangement of thecandidate text/document associating means 23. As shown in FIG. 5, thecandidate text/document associating means 23 includesword-within-candidate-text section extracting means 41 for extracting apredetermined word from a candidate text, document section extractingmeans 42 for dividing a document into sections with arbitrary lengthbased on document data, word-within-document section extracting means 43for extracting a predetermined word from document data, candidate textsection/document section associating means 44 for associatingvoices/images with a document, and word-similarity-within-candidate-textsection/document section calculating means 45 for calculating similarityof respective words. The basic word extracting means is realized bymeans 41 and means 43.

The means 41 extracts one or more words included in each audio/imagesection from a candidate text corresponding to the section.

The document section extracting means 42 divides a document included indocument data extracted from the document storage means 40 for anarbitrary length (step S105). In this embodiment, each of the dividedsections is called document section. For example, the document sectionextracting means 42 extracts a document section by using line feedinformation or null line information in a document. For example, thedocument section extracting means 42 may extract a document section byusing structural information in a document such as a chapter heading.

The document section extracting means 42 may extract a document sectionby using the analyzed result of a word used in a document. For example,if a document is created for news and words in a document change fromwords relating to sports including “baseball, homerun, complete games”to words relating to politics and economics including “business, PrimeMinister, tax cut”, the document section extracting means 42 determinesthat the topic changed. In such a case, the document section extractingmeans 42 performs for example, vector conversion on appearance frequencyof a word extracted by a word frequency analysis in each line of adocument, and detects as a topic converting point a point where thevector value changed significantly and extracts a section divided attopic converting points as a document section.

The word in document section extracting means 43 extracts one or morewords included in a document for each document. The candidate textsection/document section associating means 44 compares a set of wordsextracted by the 41 and a set of words extracted by the means 43,associates each audio/image section with a document section and outputsthe result. That is to say, the words extracted by means 41 and themeans 43 play a role of basic words to be a basic unit for matching acandidate text and a document. The means 44 matches the candidate textand the document by comparing respective sets of words and associateseach audio/image section with a document section.

The candidate text section/document section word similarity calculatingmeans 45 calculates a distance in a section between each audio/imagesection and each document section and calculates a similarity of wordsincluded in each section.

The means 44 associates each audio/image section with a document sectionbased on the similarity calculated by the means 45 and derivesrelationship between voices/images and a document (step S106).

FIG. 6 is an illustration schematically showing operation of the processof associating sections performed by the candidate text/documentassociating means 23. In the example shown in FIG. 6, four documentsections 601, 602, 603, 604 are extracted by the document sectionextracting means 42 with reference to line feed/null line information.The word in document section extracting means 43 extracts (informationcommunication, speech recognition, semantic information, . . . ),(security, video camera, animal body, . . . ), (experiment, . . . ),(study, . . . ) in each of document sections 601, 602, 603, 604 as animportant word.

On the other hand, the audio/image section extracting means 21 sets anaudio/image section by audio/image analysis. In the example shown inFIG. 6, the audio/image section extracting means 21 extracts (13:41,15:41), (15:41, 16:50), (16:50, 20:15), (20:15, 21:13) . . . to be namedaudio sections 605, 606, 607, 608 by using silent sections. In each ofthe audio sections 605, 606, 607, 608, the means 41 extracts words suchas (speech recognition, semantic information, . . . ), (informationcommunication, semantic information, . . . ), (security, . . . ),(study, . . . ).

Basic words extracted by the means 41 and the means 43 may be merelynouns extracted from a candidate text or a document. With dictionarydatabase registering important words, the means 41 and the 43 mayextract words that match ones in the dictionary. The means 41 and themeans 43 may determine a level of importance of a word by an analysis offrequency of word usage.

Based on similarity (redundant degree) of a word line, the candidatetext section/document section associating means 44 may deriverelationship between respective sections. For example, if similaritycalculated by the means 45 exceeds a predetermined threshold, the means44 may determine that the audio/image matches a document and deriverelationship. As shown in FIG. 6, for the section which cannot beassociated with, the means 43 may conclude that “association cannot beestablished”.

In the example shown in FIG. 6, sections are associated by focusing on aredundant degree of a word included in each section, but association maybe performed with appearance frequency of the entire words. For example,the relationship deriving means 50 may generate a histogram on all thewords included in each section instead of extracting important words,and determine the appearance frequency of the entire words based onsimilarity of the histogram and decide the associated sections.

The means 44 may perform association by processing an effectivecalculation reusing the calculation result based on the dynamicprogramming matching system (DP matching system). In such a case, themeans 44 establishes correlations by extracting relationship in the timeorder and calculating the accumulated relationship so that the totalmatch is maximized. Deriving the audio/image section and deriving thedocument section are essentially equivalent even if they are performedin a method other than the method shown in the embodiment.

The width (extraction unit) of document section or audio/image sectionto be extracted may be in units of a word, several words, a sentence,several sentences or a chapter insofar as they are a collective unit. Asthe optimal value for a unit of section varies depend on accuracy ofspeech recognition or how a document is summarized, a user or a systemadministrator may set a unit of section according to the situation.

FIG. 7 is an illustration showing an example where a candidate text anda document are associated with each other by speech recognition on aword level. In the example shown in FIG. 7, relationship is derivedbased on a word appearing in a candidate text for voices or images and aword appearing in a document. If a candidate text by speech recognitionincludes an error, the relationship deriving means 50 can correctlyderive relationship based on a correctly recognized word. Such a methodof deriving relationship based on a word level is appropriate for adocument which must correctly describe remarks in a meeting, such as theminutes of statements at the Diet.

FIG. 8 is an illustration showing an example of associating a candidatetext with a document on a level of word line (word group) extracted fromeach section. In the example shown in FIG. 8, the relationship derivingmeans 50 measures similarity (redundant degree of words or the like)between word lines extracted from each audio/image section and documentsection, and generates relationship for each section. In the exampleshown in FIG. 8, the relationship deriving means 50 determines that wordlines (security, animal body extraction, case) and (security, animalbody extraction) are associated with each other and word lines (videocamera, adjustment) and (video camera, correction) are associated witheach other. As a document section associated with an audio section(tracking) is not present, the relationship deriving means 50 deems therelationship as “not associated” for the audio section (tracking).

Such a method of deriving relationship on a level of word line isappropriate for a case where it is desirable to derive relationshipbetween a summarized sentence and voices instead of correctly reflectingremarks in the actual event and present the relationship to a user. Ifdocuments on objective events are described in order of occurrence (timeline information is stored) and the DP matching system is used forassociation between a document and voices, appropriate relationship canbe derived even when isolated expressions are present.

In the above-mentioned example, although a candidate text is extractedfrom track audio data or input audio data by using speech recognitionfor audio data, an image may be converted into text through imagerecognition. For example, if image data includes an image of anautomobile, “automobile” can be visually recognized and converted intotext so as to match a basic word in a document.

When the relationship deriving means 50 derives relationship based onaudio data or image data stored in the audio/image storage means 30 anddocument data stored in the document storage means 40, it generatescorrespondence table data (correspondence) in which voices or images anda document are corresponding with each other. Although thecorrespondence table data is automatically generated from the means 50,it is generated according to the proceedings drafted by a timekeeper,for example. The correspondence table data may be manually generatedwith an image editing device or interactively generated.

FIG. 9 is an illustration showing an example of relationship derived bythe means 50 and correspondence table data for presenting therelationship. As shown in FIG. 9, the means 50 generates acorrespondence table data 901 based on relationship between derivedvoices/images and a document.

As shown in FIG. 9, when relationship between voices/images and adocument is derived, voices/images may lack corresponding sections todocument sections. For example, as shown in FIG. 9, no audio/imagesection associated with a document section 904 may exist. There may beno document section corresponding to an audio/image section. Forexample, as shown in FIG. 9, audio/image section 906 (2:48-3:13) whichdoes not have any associated document section may be present. Therelationship deriving means 50 generates correspondence table data 901in the form of the time of “images or voices (starting time and endingtime of a section): corresponding document”, for example.

The relationship deriving means 50 may be adapted to correctcorrespondence table data according to operation of a user such as aperson who creates the minutes of a meeting or a participant of ameeting (hereinafter simply referred to as user). FIG. 10 is a blockdiagram showing an example of an arrangement of the relationshipderiving means 50 correcting correspondence table data according touser's operation. As shown in FIG. 10, the relationship deriving means50 includes relationship analysis deriving means 61 for derivingrelationship between voices/images and a document, relationship holdingmeans 62 for storing correspondence table data, and relationshipcorrecting means 63 for correcting correspondence table data.

The relationship analysis deriving means 61 derives relationship betweenvoices/images and a document by analyzing audio data or image data anddocument data, and generates correspondence table data. The relationshipanalysis deriving means 61 includes, for example, the audio/imagesection extracting means 21, the speech recognition means 22, and thecandidate text/document associating means 23. The relationship holdingmeans 62 is a storage medium such as RAM, flash memory or a hard diskand stores correspondence table data generated by the relationshipanalysis deriving means 61.

The relationship correcting means 63 corrects correspondence table datastored in the relationship holding means 62 according to user'soperation. For example, when a user wants to correct correspondencetable data, the user operates the documentation browsing apparatus andinputs correction instructions about the correspondence table data.Then, the documentation browsing apparatus displays an edit screen ofthe correspondence table data. The relationship correcting means 63corrects the correspondence table data according to the user'soperation.

The relationship presenting means 60 displays information on the resultof relationship between voices/images and a document based on thecorrespondence table data generated by the relationship deriving means50 (step S107). That is to say, the relationship between voices/imagesand a document is presented to the user.

For example, when relationship between voices/images and a document isderived, a person who creates the minutes of a meeting checks/correctsthe contents of the proceedings based on the relationship. The personwho creates the minutes of a meeting may inform each of the participantsby E-mail that the minutes of the meeting has been completed. When eachof the participants wants to check the contents of the proceedings, theparticipant operates the documentation browsing apparatus and inputs arequest of displaying the minutes of the meeting. Then, the relationshippresenting means 60 displays information on the result of relationshipbetween voices/images of the meeting and a document of the minutes ofthe meeting.

The documentation browsing apparatus may be connected to a terminal ofeach of the participants via a communication network such as LAN. Theterminal of each of the participants may include the relationshippresenting means 60. In such a case, each of the participants mayoperate his/her terminal and check information on the result via LAN orthe like or correct correspondence table data.

FIG. 11 is an illustration showing an example of a relationship displayscreen showing relationship between voices/images and a documentdisplayed by the relationship presenting means 60. In this example, acase where a relationship display screen is displayed based on thecorrespondence table data shown in FIG. 9 will be described. As shown inFIG. 11, the relationship display screen includes a time axis bar 1101with an ordinate indicating time information showing elapsed time of animage and audio. As shown in FIG. 11, the relationship display screenrepresents starting time/ending time/duration of a section by displayingeach part of the time axis bar 1101 with different color for each ofaudio/image sections 1102, 1103 and 1104. For example, in FIG. 11, asection 1103 of “13:16-14:18” is displayed in the same color in the samesection on the time axis bar 1101. The color of the section 1103 isdisplayed in different color from those of adjacent sections 1102 and1104, respectively.

Characters of a document corresponding to each of the sections 1102,1103 and 1104 are also displayed by different colors as they aredisplayed on the time axis 1101. The characters are displayed indocument displaying sections 1105, 1106 and 1107 in association witheach of the audio/image sections 1102, 1103 and 1104.

Each of the document display sections 1105, 1106 and 1107 includes aplayback button 1108 for playing back voices or images. The relationshipdisplay screen includes an image screen 1109 for playing back anddisplaying an image. When a user clicks on the playback button 1108,voices of the associated section are played back and the image is playedback and displayed on the image screen 1109.

Each of the document displaying sections 1105, 1106, 1107 includes anediting button 1110 for editing a displayed document. When a user clickson the editing button 1110, an editing screen displaying the contents ofa document of the associated section is displayed. Then, the user cancreate, add, adjust or delete the document.

As shown in FIG. 11, the relationship display screen includes an imageslide bar 1111 under the image screen 1109 for scrolling an image. Therelationship display screen also includes a character slide bar 1112 forscrolling character information displayed on the document displayingsections 1105, 1106 and 1107. For example, the user can maneuver theimage slide bar 1111 and the character slide bar 1112 together toplayback an image or display a document freely in a particular section.

FIG. 12 is an illustration showing another example of a method fordisplaying a time axis for displaying a document and an associated imageon the relationship display screen. In this example, a case where a timeaxis is displayed based on correspondence table data shown in FIG. 12(a) will be described. FIG. 12 (b) shows a case where a length of asection on the time axis bar 1101 for each image (audio)/documentsection in proportion to an image (audio) time. That is to say, in FIG.12 (b), the length of displaying a section is decided in proportion tothe length of the image (audio) time. By displaying the length inproportion to the time information, the user can intuitively recognizeand understand the time spent for each item in a meeting.

FIG. 12 (c) shows a case where a length of a section on the time axisbar 1101 of each section is fixed by the predetermined equal length. Theuser can intuitively recognize the number of image (audio)/documentsections or the number of all sections displayed on the screen with thelength fixed. FIG. 12 (d) is an example where the length of a section onthe time axis bar 1101 is in proportion to the amount of documentcreated. By making the length in proportion to the amount of document,the embodiment can increase the display density to display a document.The user can efficiently browse or create a document.

If each section is too small for displaying a document at a time, eachsection may be displayed in a box including a scroll bar. The embodimentenables a user to browse the entire document by allowing the user toscroll a scroll bar for each section with a mouse. A method fordisplaying a time axis is not limited to ones described in theembodiment. The present invention may be adapted to enable a user todecide as needed whether or not to display a time axis.

FIG. 13 is a block diagram showing an example of an arrangement of therelationship presenting means 60 to allow a user to select a method fordisplaying a time axis. As shown in FIG. 13, the relationship presentingmeans 60 includes a presenting type selecting means 51 for selecting atype of a time axis display (presentation) and an individualrelationship presenting means 52 for presenting relationship betweenvoices/images and a document by displaying the relationship. The means51 decides a size for an image/document section according to a selectioninstruction by a user. Then, the means 52 displays relationshipaccording to the display size decided by the means 51. The displayingtype selecting means is realized by the presenting type selecting means51.

The relationship presenting means 60 may be adapted to detect that noassociated section is present as inconformity when correspondence tabledata includes an audio/image section with no associated document sectionor when the correspondence table data includes a document section withno associated audio/image section.

FIG. 14 is a block diagram showing an example of an arrangement of therelationship presenting means 60 when it detects inconformity. As shownin FIG. 14, the relationship presenting means 60 includes inconformitydetecting means 81 for detecting inconformity of sections, inconformitypresenting means 82 for displaying and presenting inconformity, andassociation presenting means 53 for displaying and presentingrelationship between voices/images and a document, when the relationshippresenting means 60 performs inconformity detection.

When correspondence table data includes an audio/image section with noassociated document or when the correspondence table data includes adocument section with no associated audio/image section, theinconformity detecting means 81 detects inconformity. When theinconformity detecting means 81 detects inconformity, the inconformitypresenting means 82 displays that inconformity is detected. Theassociation presenting means 53 displays relationship betweenvoices/images and a document based on correspondence table data. Theassociation presenting means 53 includes, for example, the presentingtype selecting means 51 and the individual relationship presenting means52. The inconformity displaying means is realized by the inconformitypresenting means 82.

The relationship presenting means 60 may be adapted to display arelationship table screen including information on a speech recognitionresult performed at the relationship derivation. FIG. 15 is anillustration showing another example of a relationship display screendisplayed by the relationship presenting means 60. As shown in FIG. 15,the relationship table screen displays information on a speechrecognition result performed at the relationship derivation and thecontents of the corresponding documents.

The embodiment enables a user to easily recognize the contents of ameeting even when the contents of a meeting have no description of adocument associated with voices/images. For example, in FIG. 15, a usercan easily recognize the contents of the meeting by checking thecontents of information of a speech recognition result “therefore,” evenwhen any description of an associated document for a part 1501“therefore” in the information of the speech recognition result is notpresent. The user can easily recognize the contents of the meeting. Theuser can easily recognize how a document is made for the actual dialogby displaying the speech recognition result.

The relationship presenting means 60 may be adapted to display arelationship table screen including a basic word extracted from speechrecognition result. FIG. 16 is an illustration showing yet anotherexample of a relationship display screen displayed by the relationshippresenting means 60. As shown in FIG. 16, the relationship table screendisplays the extracted basic word (in the embodiment, “minutes of ameeting creating apparatus” or “relationship”) as well as the contentsof the associated document is displayed in association with each of thecontents of the document. The user can immediately recognize thecontents described in each section with the basic word displayed orpresented. That is to say, the basic word plays a role of a keyword fora user to recognize the described contents.

When the user checks the relationship display screen and determines thatthe displayed relationship is wrong, the user operates the documentationbrowsing apparatus and inputs an adjustment instruction of relationship.The relationship presenting means 60 determines whether a user inputs anadjustment instruction of relationship or not (step S108). When it isdetermined that an adjustment instruction is inputted, the relationshippresenting means 60 adjusts and updates the contents of thecorrespondence table data according to operation of the user (stepS109). Then, the relationship presenting means 60 displays arelationship display screen based on the updated correspondence tabledata (step S107).

When the user checks the relationship display screen and wants toplayback voices or images associated with a document, the user operatesa documentation browsing apparatus and inputs an instruction to playbackvoices/images. For example, the user inputs a playback instruction byclicking on the playback button 1108 on the relationship display screen.

If it is determined that an adjustment instruction is not inputted atstep S108, the associated audio/image playback means 80 determineswhether a user inputs an instruction to playback voices/images (stepS110). If it is determined that an adjustment instruction is inputted,the associated audio/image playback means 80 plays back and displaysvoices/images associated with a section for which a playback instructionis issued (step S111).

When the user wants to edit the contents of a displayed document, theuser operates the documentation browsing apparatus and inputs aninstruction to edit the document. For example, the user inputs anediting instruction by clicking on the editing button 1110 on therelationship display screen.

If it is determined that a playback instruction is not inputted at stepS110, the document editing means 70 determines whether a user inputs anediting instruction (step S112). If it is determined that an editinginstruction is inputted, the document editing means 70 performscreation, addition, edition or deletion of a document according tooperation of a user (step S113).

When the document editing means 70 edits a document, it extracts fromthe document storage means 40 only data corresponding to a section whichis specified by the relationship presenting means 60 according tooperation of a user among target document data. For example, when a userclicks on the editing button 1110 of a place where the user wants toedit on the relationship display screen, the document editing means 70extracts from the document storage means 40 only data corresponding to adocument section including the editing button 1110 on which the userclicked.

The document editing means 70 updates document data by editing adocument included in the extracted data. For example, the user operatesthe documentation browsing apparatus and edits document data of adrafted minutes of a meeting stored in the document storage means 40 tocomplete the minutes of a meeting.

The document editing means 70 may be not only adapted to edit byextracting data including only a document of a part of an editing objectin document data but also to adjust the entire document by extractingthe entire document data. Each of the participants of a meeting may beallowed to operate a terminal of his/her own and edit correspondencetable data or document data via a LAN and playback voices/images.

Although adjustment of relationship (step S109), playback of voices orimages (step S111), and edition of a document (step S113) are performedin sequence in the embodiment, whole the process can be performed inparallel.

If the user edits a document and wants to have relationship betweenvoices/images and a document recalculated, the user operates thedocumentation browsing apparatus and inputs an instruction torecalculate the relationship. If it is determined that an editinginstruction is not inputted at step S112, the relationship updatingindicating means 100 determines whether a user inputs a recalculationinstruction or not (step S114).

If it is determined that a recalculation instruction is inputted, therelationship updating indicating means 100 causes the relationshipderiving means 50 to perform recalculation of derivation of relationshipbetween voices/images and a document. Then, the documentation browsingapparatus repeatedly performs the process from step S106 to step S114.

In order to recalculate relationship after editing the document, therelationship deriving means 50 may create another relationship and therelationship presenting means 60 may update the relationship displayscreen according to an instruction inputted from the user. Or, thedocumentation browsing apparatus may automatically detect that documenthas been edited, and the relationship deriving means 50 may createrelationship, and the relationship presenting means 60 may update therelationship display screen. When a user inputs an instruction or thecompletion of editing of a document is detected, the relationshipupdating indicating means 100 causes the relationship deriving means 50to recalculate relationship. That is to say, the user can giveinstructions to derive relationship via the relationship updatingindicating means 100.

The document editing state observing means 110 observes an editing stateof a document. When the contents of the document is updated, the means110 may cause the means 100 to perform the relationship updatingprocess. Then, the user can derive relationship without issuing aninstruction after the document is edited.

If it is determined that a recalculation instruction is not inputted atstep S114, the outputting means 90 determines whether edition process ofa document or the like is completed or not (step S115). If it isdetermined the editing process is completed, the outputting means 90outputs a document based on document data accumulated in the documentstorage means 40 (step S116). For example, the outputting means 90prints the minute of a meeting based on document data on the completedminutes of a meeting accumulated in the document storage means 40. If itis determined that the edition process is not completed at step S115,the documentation browsing apparatus repeatedly performs the processafter step S107. The outputting means 90 may output a document accordingto operation of a user.

As mentioned above, according to the embodiment, relationship betweenvoices or images and a document based on the voices or the images isderived and a relationship display screen with voices or images anddocuments being associated with each other is displayed and presented tothe user. The user can edit a document by checking a relationship tablescreen and playing back only voices or images of a particular placeneeded to create a document. That makes user's documentation moreeffective.

When a plurality of users create a document, each user can playback andbrowse grounds for creating the document or voices or images associatedwith a particular place of a document. That facilitates all the users toimmediately recognize the contents of the document so that a pluralityof users' documentation can be made effective. Therefore, a document canbe effectively created based on image data or audio data recorded at ameeting or a lecture. The person who creates the document or theparticipant of the meeting can browse a summarized document with voicesor images so that a plurality of persons' documentation can be madeeffective.

According to the embodiment, when a user creates the minutes of ameeting, the user indicate to select a particular section of thedocument included in the minutes of the meeting so that the user canlocate the start point of voices or images of a corresponding part. Asthe embodiment can display a place which cannot be used for creating theminutes of the meeting in recorded voices or a recorded image of ameeting and present it to the person who creates the minutes of themeeting, the person who creates the minutes of the meeting caneffectively browse voices or images in a segment which is not describedin the minutes of the meeting and edit the minutes of the meeting byeffectively adjusting or updating the minutes of the meeting.

As the embodiment can display a place of the document which does notinclude associated voices or images and present them to the person whocreates the minutes of the meeting, the person who creates the minutesof the meeting can easily edit the minutes of the meeting by adjustingor updating the minutes of the meeting. That can reduce cost forcreating the minutes of the meeting. With this embodiment, participantsof the meeting can easily check the contents of the minutes of themeeting.

Even if a lecture hall has poor conditions for collecting voices of alecture, the embodiment can complement accuracy of speech recognition bymatching words extracted from text or a document by speech recognition.Therefore, a user neither needs to have sophisticated speech recognitionmeans nor to set an inputting device or an analysis device of voices foreach of the participants of the meeting such as a sophisticatedmicrophone to perform speech recognition. Therefore, the embodiment canreduce the size of the entire system for documentation and browsing, andalso reduce an equipment cost.

EMBODIMENT 2

Now, the second embodiment of the present invention will be describedwith reference to drawings. In the embodiment, a case where thedocumentation browsing apparatus is realized by a software program willbe particularly described. FIG. 17 is a block diagram showing anotherexample of an arrangement of a documentation browsing apparatus. Asshown in FIG. 17, the documentation browsing apparatus includesaudio/image inputting means 10, document inputting means 20, programmemory 71 for storing various programs for performing process ofdocumentation or browsing a document, a microprocessor 72 for performingprocess according to a program stored in the program memory 71,information memory 73 for storing various types of data, displayingmeans 74 for displaying a document or the like, instruction inputtingmeans 76 for a user to input various instructions, and audio/imageplayback means 75 for playing back voices or images.

The audio/image storage means 30 and the document storage means 40 inthe first embodiment are realized by the information memory 73. Therelationship deriving means 50, the relationship presenting means 60,the document editing means 70 and the associated audio/image playbackmeans 80 are realized by the microprocessor 72 for performing processaccording to a program stored in the program memory 71, the displayingmeans 74, the audio/image playback means 75, the instruction inputtingmeans 76 and the like. The arrangement and functions of the audio/imageinputting means 10 and the document inputting means 20 are same as thoseof the first embodiment.

The microprocessor 72 extracts audio/image data and document data fromthe information memory 73 and derives relationship between voices/imagesand a document, according to a relationship deriving program 711 storedin the program memory 71. The microprocessor 72 stores generatedcorrespondence table data in the information memory 73. Themicroprocessor 72 extracts correspondence table data from theinformation memory 73 and causes the displaying means 74 to display therelationship table display screen between voices/images and a document,according to a relationship presenting program 712 stored in the programmemory 71.

When a user inputs a replay instruction of voices/images of a particularimage from the instruction inputting means 76, the microprocessor 72causes the audio/image playback means 75 to playback the voices/imagesaccording to the audio image playback program 713 stored in the programmemory 71. The microprocessor 72 creates, deletes, updates or adjustsdocument data stored in the information memory 73 according to adocument editing program 714 stored in the program memory 71. A type ofderiving basic relationship, a presenting type, a type of playing backvoices/images and a method for editing document data are same as thoseof the first embodiment.

EMBODIMENT 3

Now, the third embodiment of the present invention will be describedwith reference to drawings. In the embodiment, a case wheredocumentation is performed by using a robot mounted with a documentationbrowsing apparatus will be described. FIG. 18 is an illustration showingan example of a documentation browsing robot having the documentationbrowsing apparatus.

As shown in FIG. 18, a robot 91 includes a camera 10 a and a microphone10 b as audio/image inputting means 10 and document inputting means 20such as a scanner. The camera 10 a is image inputting means such as avideo camera. The microphone 10 b is audio inputting means such as amicrophone. As shown in FIG. 18, the robot 91 includes the relationshippresenting means 60 such as a display device.

The robot 91 is present at a meeting with the other participants of themeeting in the meeting room and shoots scenes of the meeting with thecamera 10 a, and records the contents of the meeting with the microphone10 b. For example, the robot 91 shoots scenes of the meeting by changingdirection as it automatically rotates to a speaker.

The person who creates the minutes of the meeting operates the documentinputting means 20 and inputs document data such as a drafted minutes ofa meeting in the robot 91. The robot 91 has the audio/image storagemeans 30, the document storage means 40, the relationship deriving means50 and the like. The robot 91 derives relationship between voices/imagesand a document by using functions of the relationship deriving means 50and displays a relationship table screen on the relationship presentingmeans 60 based on generated correspondence table data.

When a user browses a presented screen and operates an operation unit(not shown) of the robot 91 to input a playback instruction for a partwhich the user wants to playback voices or images, the robot 91 playsback the voices or the images for the associated section. With the robot91 including the above mentioned functions, the embodiment can generateor brows the minutes of a meeting by letting the robot 91 be present ineach meeting room.

Although a case where the minutes of a meeting is generated is describedin the embodiment, the documentation browsing robot can be adapted tousage other than creating the minutes of a meeting. For example, thedocumentation browsing robot can be adapted to all the usage such as tocreate or browse a document by using voices or images as it is or tocreate or browse by summarizing the voices or the images in the case ofcreating the minutes of a lecture or a lesson, in the case of generatingdiary or observations.

INDUSTRIAL APPLICABILITY

In the documentation browsing method according to the present invention,the documentation browsing apparatus derives relationship betweenvoices/images and documents based on image data of scenes of a meetingshot or audio data of the contents of the meeting, and document dataincluding a drafted proceeding created by a person who creates theminutes of a meeting. The documentation browsing apparatus displays andpresents voices/images and a document being associated with each otherto each of the participants of a meeting or a person who creates theminutes of a meeting. Therefore, the participants or the person whocreates the minutes of a meeting can playback only voices or images ofthe necessary part, checks them and edits and completes the minutes ofthe meeting. As the documentation browsing apparatus can easily presentvoices/images and a drafted proceeding of a meeting which are associatedwith each other to all the participants of the meeting, it can makeoperation for a plurality of participants to create the minutes of ameeting more effective.

1. (canceled)
 2. (canceled)
 3. (canceled)
 4. (canceled)
 5. Documentationbrowsing apparatus characterized by comprising: correspondencegenerating means for generating correspondence between voices or imagesincluded in audio data or image data and a document included in documentdata; association displaying means for displaying the voices or theimages included in said audio data or said image data and a documentincluded in said document data associated with each other based on saidcorrespondence; document updating means for updating said document dataassociated with a document displayed by said displaying means; andmatching means for establishing a matching between a document and soundsor images; wherein said correspondence generating means comprises: meansfor generating association information based on said matchingestablished by said matching means; relationship recalculationinstruction means for outputting recalculation instruction informationfor instructing recalculation of relationship between a document andvoices or images and letting the matching means recalculate saidrelationship.
 6. The documentation browsing apparatus according to claim5, wherein the correspondence generating means divides voices or imagesincluded in audio data or image data and a document included in documentdata into predetermined sections and generates correspondence voices orimages and a document for said each section; and wherein the documentupdating means generates, adds, corrects or deletes a document for saideach section.
 7. The documentation browsing apparatus according to claim5, comprising playback means for playing back voices or images; whereinthe correspondence generating means divides voices or images included inaudio data or image data and a document included in document data intopredetermined sections and generates correspondence between voices orimages and a document for said each section; and wherein the documentupdating means generates, adds, corrects or deletes a document for saideach section; and wherein said playback means plays back voices orimages of a part associated with at least one section included in saiddocument data among voices or images included in said audio data or saidimage data.
 8. The documentation browsing apparatus according to any oneof claims 5 to 7, wherein the document data includes a document of theminutes of a meeting or a lecture or a lesson, and the audio data or theimage data includes voices or images of recorded contents of themeeting, the lecture or the lesson.
 9. The documentation browsingapparatus according to any one of claims 5 to 8, wherein thecorrespondence generating means extracts a starting time and an endingtime of a section of voices or images and generate associationinformation which associates said starting time and said ending timewith document data.
 10. The documentation browsing apparatus accordingto any one of claims 5 to 9, wherein the association displaying meansdisplays time information which indicates the elapsed time of voices orimages associated to each section of document data.
 11. Thedocumentation browsing apparatus according to any one of claims 5 to 10,wherein the association displaying means displays a displaying locationof a document included in document data on a display screen inassociation with time information which indicates an elapsed time ofvoices or images.
 12. The documentation browsing apparatus according toclaim 11, wherein the association displaying means displays the lengthof displaying each section of document data on a display screen by alength in proportion to the playback time of the sounds or the imagesassociated with said each section.
 13. The documentation browsingapparatus according to claim 11, wherein the association displayingmeans displays the length of displaying each section of document data ona display screen by a predetermined length.
 14. The documentationbrowsing apparatus according to claim 11, wherein the associationdisplaying means displays the length of displaying each section ofdocument data on a display screen by a length in proportion to theamount of documents in documents associated with said each section. 15.The documentation browsing apparatus according to claim 11, comprisingdisplay type selection means for selecting a type of displaying thelength of displaying each section of document data on a display screen;wherein said display type selection means selects a display type of alength in proportion to the playback time of the voices or the imagesassociated with said each section, or a predetermined length, or alength in proportion to the amount of documents in documents associatedwith said each section according to a user's selecting instruction; andthe association displaying means displays said each section according tothe display type selected by said display type selection means.
 16. Thedocumentation browsing apparatus according to any one of claims 11 to15, wherein the association displaying means displays a time length ofvoices or images by a length of a display bar indicating a time axis ona display screen, and displays the time length of said voices or saidimages and characters in a document of a section associated with saidvoices or said images in the same color.
 17. The documentation browsingapparatus according to any one of claims 11 to 16, comprising: mismatchdetecting means for detecting the case where voices or images and adocument is not associated with each other as mismatch state of thesounds or the images and the document, and mismatch displaying means fordisplaying that the voices or the images and the document mismatch whenthe mismatch is detected; wherein said mismatch displaying meansdisplays that no section of documents associated with a section ofvoices or images exists or that no section of voices or imagesassociated with a section of the document exists as a mismatch state.18. (canceled)
 19. The documentation browsing apparatus according to anyone of claims 5 to 17, comprising document storage means for storingdocument data and audio and image storage means for storing audio dataor image data.
 20. The documentation browsing apparatus according toclaim 5, wherein the matching means establishes a matching between adocument and voices or images by matching text generated by speechrecognition or image recognition and a document included in documentdata.
 21. The documentation browsing apparatus according to claim 20,wherein the matching means matches text and a document by using adynamic program matching system with a word appearing in text or adocument.
 22. The documentation browsing apparatus according to claim 20or 21, wherein the association displaying means displays text associatedwith said document among texts generated by the matching means inaddition to a document included in the document data.
 23. Thedocumentation browsing apparatus according to claim 20, comprising basicword extracting means for extracting a basic word to be a basic unit formatching text and a document from words included in each section of saidtext and said document; wherein the matching means establishes amatching between a document and voices or images by calculating andcomparing similarities between groups of basic words, which are sets ofthe basic words extracted by said basic word extracting means.
 24. Thedocumentation browsing apparatus according to claim 23, wherein thematching means matches text and a document by using a dynamic programmatching system based on a group of basic words including basic wordsextracted by the basic word extracting means.
 25. The documentationbrowsing apparatus according to claim 23 or 24, wherein the associationdisplaying means displays a basic word associated with a document amongbasic words extracted by the basic word extracting means in addition tosaid document included in the document data.
 26. The documentationbrowsing apparatus according to any one of claims 5 to 25, wherein thematching means comprises relationship correction means for correctingthe extracted relationship according to user's operation.
 27. (canceled)28. The documentation browsing apparatus according to claim 5,comprising the document update determination means for determiningwhether the document updating means updates document data or not;wherein the relationship recalculation instruction means outputsrecalculation instruction information and causes the relationshipextraction means to recalculate relationship when the document data isdetermined to be updated.
 29. The documentation browsing apparatusaccording to any one of claims 5 to 28, comprising outputting means foroutputting a document based on document data.
 30. A documentationbrowsing robot comprising documentation browsing apparatus described inany one of claims 5 to
 29. 31. A documentation browsing programcharacterized by causing a computer to execute the process of:generating correspondence between voices or images included in audiodata or image data with a document included in document data; displayingthe voices or the images included in said audio data or said image dataand the document included in said document data associated with eachother based on said correspondence; and updating said document dataassociated with the displayed document; displaying a displaying locationof a document included in document data on a display screen inassociation with time information which indicates an elapsed time ofvoices or images; and outputting recalculation instruction informationfor instructing recalculation of relationship between a document andvoices or images and recalculating said relationship.
 32. Thedocumentation browsing program according to claim 31, allowing toexecute the process of: dividing voices or images included in audio dataor image data and a document included in document data for predeterminedsections and generating correspondence between voices or images and adocument for said each section; and generating, adding, correcting ordeleting a document for said each section.
 33. The documentationbrowsing program according to claim 31, causing a computer to executethe process of: dividing voices or images included in audio data orimage data and a document included in document data for predeterminedsections and generating correspondence between voices or images and adocument for said each section; and generating, adding, correcting ordeleting a document for said each section; and playing back voices orimages of a part associated with at least a section included in saiddocument data among voices or images included in said audio data or saidimage data.
 34. The documentation browsing program according to any oneof claims 31 to 33, causing a computer to execute the process based ondocument data including a document of the minutes of a meeting or alecture or a lesson, and the audio data or the image data includingvoices or images of recorded contents of the meeting, the lecture or thelesson.
 35. The documentation browsing program according to any one ofclaims 31 to 34, causing a computer to execute the process of:extracting a starting time and an ending time of a section of voices orimages and generating association information which associates saidstarting time and said ending time with document data.
 36. Thedocumentation browsing program according to any one of claims 31 to 35,causing a computer to execute the process of displaying time informationwhich indicates elapsed time of voices or images associated to eachsection of document data.
 37. (canceled)
 38. The documentation browsingprogram according to claim 31, causing a computer to execute the processof displaying the length of displaying each section of a document dataon a display screen by the length in proportion to the playback time ofthe voices or the images associated with said each section.
 39. Thedocumentation browsing program according to claim 31, causing a computerto execute the process of displaying the length of each section ofdocument data on a display screen by a predetermined length.
 40. Thedocumentation browsing program according to claim 31, causing a computerto execute the process of displaying the length of each section ofdocument data on a display screen by the length in proportion to theamount of documents in documents associated with said each section. 41.The documentation browsing program according to claim 31, causing acomputer to execute the process of: selecting a display type as displaytype of a length of document data on a display screen of each sectionfrom displaying a length in proportion to the playback time of thesounds or the images associated with said each section, or apredetermined length, or a length in proportion to the amount ofdocuments in documents associated with said each section according to auser's indication; and displaying said each section according to theselected display type.
 42. The documentation browsing program accordingto any one of claims 31 to 41, causing a computer to execute the processof: displaying a time length of voices or images by a length of adisplay bar indicating a time axis on a display screen, and displayingthe time length of said voices or said image and characters in adocument of a section associated with said audio or said image in thesame color.
 43. The documentation browsing program according to any oneof claims 31 to 42, causing a computer to execute the process of:detecting the case where voices or images and a document is notassociated with each other as a mismatch state of the voices or theimages and the document, and displaying that no section of documentsassociated with a section of voices or images exists or that no sectionof voices or images associated with a section of the document exists asa mismatch state when said mismatch is detected.
 44. The documentationbrowsing program according to any one of claims 31 to 43, causing acomputer to execute the process of: establishing a matching between adocument and voices or images, and generating correspondence based onsaid extracted relationship.
 45. The documentation browsing programaccording to any one of claims 31 to 44, causing a computer to executethe process of: causing document storage means for storing document datato store document data generated based on a document inputted by a user;and causing audio and image storage means for storing audio data orimage data to store audio data or image data generated with recordeddialogs.
 46. The documentation browsing program according to claim 44,causing a computer to execute the process of: establishing a matchingbetween a document and voices or images by matching text generated byspeech recognition or image recognition and a document included indocument data.
 47. The documentation browsing program according to claim46, causing a computer to execute the process of: matching text and adocument by using a dynamic program matching system with a wordappearing in text or a document.
 48. The documentation browsing programaccording to claim 46 or 47, causing a computer to execute the processof displaying text associated with a document in generated text inaddition to said document included in the document data.
 49. Thedocumentation browsing program according to claim 46, causing a computerto execute the process of: extracting a basic word to be a basic unitfor matching text and a document from words included in each section ofsaid text and said document, and establishing a matching between adocument and voices or images by calculating and comparing similaritiesbetween groups of the basic words, which are sets of the extracted basicwords.
 50. The documentation browsing program according to claim 49,causing a computer to execute the process of matching text and adocument by using a dynamic program matching system based on a group ofthe basic words including the extracted basic words.
 51. Thedocumentation browsing program according to claim 49 or 50, causing acomputer to execute the process of displaying a basic word associatedwith said document among the extracted basic words in addition to adocument included in the document data.
 52. The documentation browsingprogram according to any one of claims 31 to 51, causing a computer toexecute the process of correcting the extracted relationship accordingto user's operation.
 53. (canceled)
 54. The documentation browsingprogram according to any one of claims 31 to 52, causing a computer toexecute the process of: determining whether document data is updated ornot; and outputting recalculation instruction information andrecalculating relationship when the document data is determined to beupdated.
 55. The documentation browsing program according to any one ofclaims 31 to 54, causing a computer to execute the process of outputtinga document based on document data.